Even after the controversy presidential results came out, the debate on social media has never ended. One of the most prominent effect of the communication happens on social media especially on Twitter.
In order to figure out what Internet users are talking about and how they feel during and post the election, we planned to collect historical data during the intervals of three presidential debates. If we could acquire the historical data during those different periods of time, we would be able to compare the information and people’s attitude towards the presidency as time went on. However, the major issue we encountered is that Twitter API only allows us to access user data from very recent seven days for confidential reasons. As a result, we changed our focus to recent data to find out what users are talking about in a recent timeframe and analyzed the sentiments and popularity of tweets.
In this project, we will mainly focus on three parts:
The statistical relations within retweet data and sentiment score generated from tweet contents using Shapiro.test, Hypothesis Test and Anova Table. We look forward to find out the relationship between sentiment and retweet number, and further compare how different hashtags will influence the user sentiments.
The frequency of words mentioned by trump supporters and protestants using Word Cloud.
Visualization of sentiments towards Trump among different hashtags (#DonaldTrump, #MAGA, #NeverTrump) and different locations (California, Florida, Great Lakes Area) integrated with shiny application to create an interactive map on tweet popularity.
In this project, all our datasets come from Twitter. We planned to explore how the attitude of the public towards Donald Trump has changed through different time periods and across different locations the US at the beginning. We use the function filterstream() to search for tweets in 6 typical states, 2 each for red, blue and swing states during the 2016 president election with hashtag "@RealDonaldTrump OR #DonaldTrump“. However, when we later reviewed those tweets we got, we found that only a few of them are actually related with the topic we chose.
To avoid this situation, we use the searchTwitter() function instead to collect tweets under three topics, which are "@RealDonaldTrump OR #DonaldTrump“,”#MakeAmericaGreatAgain“, and”#NeverTrump“. By using searchTwitter() function, we cannot get users’ locations directly, so we use the function lookupUsers() provided by Twitter API to get each user’s location then transfer into geocode by google map API. We try to rerun the search process to get a large dataset throughout the whole process to make our further exploration more actuate, however, due to the limitation on Twitter API to look up for users’ locations (we each are forbidden from the token access after the first search) and the 2500 data per computer by google map API, we can only get 10,000 data. Moreover, since those data includes all tweets around the world, when we decrease the scope to the US, we are finalized with 4728 data in total with locations within the US.
To scale the attitude of the public for quantitative analysis, we write a function called sentiment_scores() to score the sentiment of each tweets based on words in the text. We compare those words with two dictionaries that include most positive and negative sentiment words to see how strong the sentiment is and whether it is a positive sentiment or a negative one.
positives = readLines("positive-words.txt")
negatives = readLines("negative-words.txt")
sentiment_scores = function(tweets, positive_words, negative_words, .progress='none'){
scores = laply(tweets,
function(tweets, positive_words, negative_words){
tweets = gsub("[[:punct:]]", "", tweets) # remove punctuation
tweets = gsub("[[:cntrl:]]", "", tweets) # remove control characters
tweets = gsub('\\+', '', tweets) # remove digits
# Let's have error handling function when trying tolower
tryTolower = function(x){
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
# result
return(y)
}
# use tryTolower with sapply
tweets = sapply(tweets, tryTolower)
# split sentence into words with str_split function from stringr package
word_list = str_split(tweets, "\\s+")
words = unlist(word_list)
# compare words to the dictionaries of positive & negative terms
positive.matches = match(words, positive_words)
negative.matches = match(words, negative_words)
# get the position of the matched term or NA
# we just want a TRUE/FALSE
positive_matches <- !is.na(positive.matches)
negative_matches <- !is.na(negative.matches)
# final score
score = sum(positive_matches) - sum(negative_matches)
return(score)
}, positive_words, negative_words, .progress=.progress)
return(scores)
}
After scoring tweets under each topic, we draw histograms to see the distribution as a simple check whether the result matches our selection (the negative topic will have more negative sentiments). We also add a column of the absolute value of those sentiment scores for comparing only the strength of sentiment in the later analysis.
After we finalizing our data frames by only showing useful and related information, we get 3 frames for each topic and a total one, including screenname, text, retweetCount, score, absolute_score, lon, and lat. For convenience, we save those frames throughout the process and we will just load those frames here to conduct further explorations.
data3 <- read.csv("never.csv",row.names = 1)
data2 <- read.csv("maga.csv", row.names = 1)
data1 <- read.csv("trump.csv",row.names = 1)
total <- rbind(data1, data2, data3)
To have a general understanding on what are the most popular words that people use in tweets to express their thoughts regard the election, we do wordclouds under each topic and the total dataframe.
library(tm)
## Warning: package 'tm' was built under R version 3.3.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.3.2
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.3.2
## Loading required package: RColorBrewer
# function to clean text for future preparation
clean.text <- function(some_txt)
{
some_txt = gsub("&", "", some_txt)
some_txt = gsub("(RT|via)((?:\b\\W*@\\w+)+)", "", some_txt)
some_txt = gsub("@\\w+", "", some_txt)
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)
some_txt = gsub("http\\w+", "", some_txt)
some_txt = gsub("[ t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)
# define "tolower error handling" function
try.tolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}
some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}
# get cleaned tweets
tweets1 <- gettext(data1$text)
tweets2 <- gettext(data2$text)
tweets3 <- gettext(data3$text)
tweets4 <- gettext(total$text)
tweetwordcloud <- function(tweets){
clean_text = clean.text(tweets)
tweet_corpus = Corpus(VectorSource(clean_text))
tdm = TermDocumentMatrix(tweet_corpus, control = list(removePunctuation = TRUE,stopwords = c("donald", "trump", "real", "nevertrump",stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
m = as.matrix(tdm) #we define tdm as matrix
word_freqs = sort(rowSums(m), decreasing=TRUE) #now we get the word orders in decreasing order
dm = data.frame(word=names(word_freqs), freq=word_freqs) #we create our data set
y <- wordcloud(dm$word, dm$freq,scale=c(5,0.1), max.words=200, min=2,random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors = brewer.pal(8, "Dark2")) #and we visualize our data
return(y)
}
wordclound1 <- tweetwordcloud(tweets1) # word cloud under topic "Donald Trump"
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : presideninheadministration could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : americafirsdraintheswamp could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtgivescushion could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : immigrationmaga could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : michigan could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtestablishmendefieson could not be fit on page. It will not be
## plotted.
wordclound2 <- tweetwordcloud(tweets2) # word cloud under topic "MAGA"
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : makeamericagreatagain could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : earliesmakeamericagreatagain could not be fit on page. It will not
## be plotted.
wordclound3 <- tweetwordcloud(tweets3) # word cloud under topic "Never Trump"
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : generationso could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rememberhahe could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtalways could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : satisfyheir could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : never could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : candidate could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : evan could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : mcmullin could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : case could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtnevertrump could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : havenconnectedhe could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : jusyereadhis could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : said could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtin could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : like could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtthe could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : russian could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : american<U+9225><U+FFFD> could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : intelligence could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : theresistance could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : america could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : trumps could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : miromney could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtyoureellinghiso could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : saidheyre could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : theyknow could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : whyhey could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : curschilling could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : malkin could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : means could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : michelle could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : putin could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : russia could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : conspiracy could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : hired could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtwhatever could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : stop could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : election could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : itakes could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : ofhe could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : community could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : hillary could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : party could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : started could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : uniteblue could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : pushed could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : respondsohe could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : theory could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : gop could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : iwillvote could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtvideotrump could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : trumpleaks could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : face could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : notmypresiden could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : president could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtdonald could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : russianhacking could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : exactly could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : fascisakeoverhis could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : legitimate could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : lincoln could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : notmypresidennevertrump could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : whaiwould could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : <U+980E><U+7704><U+72BF><U+980E><U+7704><U+72BF> could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : nolegitimate could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : notmypresidenusa could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : republican could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : traito<U+9225> could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : dumptrump could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : resisnevertrump could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rtthis could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : rttrump could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : thedailyliberal could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : greanation could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : iamyub could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : notmypresidentheresistance could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : donthecon could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : representswohings could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : republicans could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : conservative could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : grabyourwalleresisttrump could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : havehe could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : potus could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : thedailylibera<U+9225> could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : women could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : anything could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : aheakeover could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : dismantling could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : electoralcollege could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : good could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : interestsmultinational could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : looking could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : usa could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : versionlikeshe could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : abourigh could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : againstrump could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : corporatio<U+9225> could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : goingo could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : gopguardians could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : imwithher could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : intoleranceracism could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : maga could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : pooraste could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : anymore<U+9225> could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : briefings could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : cleveland could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : curious could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : h<U+9225> could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : joinedogether could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : make could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : need could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : neverhillary could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : nochoosinghahey could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : notmypresidenelection<U+9225> could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : trumptapes could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : upseaboutrump could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : washington could not be fit on page. It will not be plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : auditthevote could not be fit on page. It will not be plotted.
wordclound4 <- tweetwordcloud(tweets4) # word cloud under the total dataset
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : makeamericagreatagain could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(dm$word, dm$freq, scale = c(5, 0.1), max.words =
## 200, : earliesmakeamericagreatagain could not be fit on page. It will not
## be plotted.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.3.2
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(broom)
library(scales)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(stringr)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
d3 = read.csv("NEVER.csv", row.names = 1)
d2 = read.csv("MAGA.csv", row.names = 1)
d4 = read.csv("total.csv", row.names = 1)
# use unnest_tokens function to remove some "stopwords"
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- d4 %>%
filter(!str_detect(text, '^"')) %>%
mutate(tweets = text) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
#find the most common words in tweets
commonword <- tweet_words %>%
dplyr::count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_bar(stat = "identity") +
ylab("Occurrences") +
coord_flip()
bing <- get_sentiments("bing")
bing_word_counts <- tweet_words %>%
inner_join(bing) %>%
dplyr::count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts <- bing_word_counts[-1, ]
#remove the top key word "trump"
bing_word_counts %>%
filter(n > 20) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment")
## Warning: Stacking not well defined when ymin != 0
Given such an interesting dataset of 4728 tweets, we have many curious questions for ourselves, for example:
Before answering the aforementioned questions, we are going to start out by summarize the data:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#add a column name "hashtag"
d3$hashtag=rep("NEVERTRUMP", 1279)
d2$hashtag=rep("MAGA", 1238)
# combine the two datasets into one for comparison
d5 <- rbind(d3, d2)
score.response = na.omit(d4$score)
retweet <- d4$retweetCount
summary(score.response)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.0000 0.0000 0.0000 0.3196 1.0000 7.0000
summary(retweet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 2.0 602.8 87.0 153900.0
Q&A
Q: Is the Sentiment Data Normal? Method: Shapiro-Wilk normality test, confidence level= 95%; QQplot Test statistics: If p<0.05, then the data is not normal.
## Shapiro.test, confidence level= 95%. If p < 0.05, our data deviates from deviation.
data_score= d4$score
shapiro.test(data_score)
##
## Shapiro-Wilk normality test
##
## data: data_score
## W = 0.88763, p-value < 2.2e-16
## Plot using a qqplot
qqnorm(data_score);qqline(data_score, col = 2)
A: This p-value tells us what the chances are that the sample comes from a normal distribution. In this case, because p<0.05, we can conclude that the data fails to be a normal distribution. To help us visualize it, we also used a QQplot. The fact that most of the data deviates from the straight line also confirms our conclusion.
For someone notorious as Trump, it is easy to believe that people still show some negative emotions towards him, even after the election. To confirm our assumption, we conducted a Hypothesis Test.
Q: After the election, do people start to have neutral attitude towards trump? Ho: score = 0 The feelings are neutral. Ha: score = -1 People are still slightly repelled by Trump. Confidence Interval = 95% Test statistic: If t is not in the range of critical values, reject.
## Hypothesis Testing (Ho: People are repulsed by Trump even after election. score = -1 )
alpha=0.05
mu0 = -1 # hypothesized value
t = (mean(score.response) - mu0)/(sd(score.response)/sqrt(4728))
t # test statistic
## [1] 70.71645
#compute the critical values at .05 significance leveL
t.half.alpha = qt(1 - alpha/2, df= 4728 - 1)
cv<- c(-t.half.alpha, t.half.alpha)
cv
## [1] -1.960466 1.960466
A: The test statistic 70.72 doesn’t lie between the critical values -1.960, and 1.960. Hence, at .05 significance level, we reject the null hypothesis and conclude that people still have negative emotions towards trump.
Q: Do tweets with stronger attitude (both positive & negative) get retweeted more? Confidence level: 95% Test Statistic: If p < 0.05, conclude that there is a linear relationship.
# Anova Table -- analyze the relationship between sentiment and retweet number
total = read.csv("total.csv")
aov.out = aov(total$retweetCount~total$score, data=total)
A: Because p = 2.86e-13<0.05, we are confident to say that retweet number is related to sentiments.
Q: How does the plot look like?
ggplot(data = total, aes(x = absolute_score, y = retweetCount)) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_continuous(limits = c(0, 10000))
## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).
A: The slope of the line is rather small because many tweets don’t get retweeted. However, there is still a linear relationship. The stronger the sentiment, the higher possibility it might be tweeted. For now, there are limited number of high scores. The accuracy of plot would improve if we have a huge dataset.
To conduct this analysis, we combined our dataset of #NEVERTRUMP and #MAKEAMERICAGREATAGAIN to see if there is any difference in the retweet number between #NeverTrump and #MAGA groups.
Q: For the two hashtag groups, is the mean retweet number the same for the population? Confidence Interval: 95% Test Statistics: p<0.05, conclude that their means are different.
# compare the retweet mean difference across the 10 score groups from -5 to 5
results <- aov(retweetCount ~ factor(hashtag), data=d5)
summary(results)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(hashtag) 1 25649247 25649247 117.6 <2e-16 ***
## Residuals 2515 548364310 218037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.lm(results)
##
## Call:
## aov(formula = retweetCount ~ factor(hashtag), data = d5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -328.2 -305.2 -125.3 2.7 4659.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 328.20 13.27 24.73 <2e-16 ***
## factor(hashtag)NEVERTRUMP -201.92 18.62 -10.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 466.9 on 2515 degrees of freedom
## Multiple R-squared: 0.04468, Adjusted R-squared: 0.0443
## F-statistic: 117.6 on 1 and 2515 DF, p-value: < 2.2e-16
A: Because p < 2e-16 <0.05, there are significant differences between #NeverTrump and #MAGA groups. Tweets with hashtag#MAGA get retweeted more than #NEVERTRUMP.
Q: For the two hashtag groups, is the mean emotion level the same for the population? Confidence Interval: 95% Test Statistics: p<0.05, conclude that their means are different.
# compare the score mean difference between the two hashtag groups
results <- aov(score ~ factor(hashtag), data=d5)
summary(results)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(hashtag) 1 120 119.55 74.03 <2e-16 ***
## Residuals 2515 4062 1.61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.lm(results)
##
## Call:
## aov(formula = score ~ factor(hashtag), data = d5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1876 -0.6236 -0.1876 0.8124 4.3764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.62359 0.03612 17.266 <2e-16 ***
## factor(hashtag)NEVERTRUMP -0.43594 0.05067 -8.604 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.271 on 2515 degrees of freedom
## Multiple R-squared: 0.02859, Adjusted R-squared: 0.02821
## F-statistic: 74.03 on 1 and 2515 DF, p-value: < 2.2e-16
A: Because p < 2e-16 <0.05, there are significant differences between #NeverTrump and #MAGA groups. Tweets with hashtag#MAGA have stronger emotions than #NeverTrump.
The main purpose of mapping is to compare and analyze the distribution of twitter users who posted tweets on our selected topics. We plan to assign different point size to different level of sentiment with a range from 0 to 7, so that both strong negative and positive opinions would be relatively more noticeable on the maps. A sentiment score of -5 shows a strong negative emotion; while a score of 7 shows an extreme positive emotion. To distinguish negative sentiments from positive ones, we plan to apply two different colors, red and blue, based on its sentiment scores.
Firstly, we create a world map using ggmap. The tweet points on the map are clustered on the areas of North America and Europe, suggesting that most twitters who posted tweets about Donald Trump are located in these regions.
Then, to view the distribution of these tweets in United States only, we create a country-wide map. From the map it clearly shows that there are more users located on the east coast and the Great Lakes.
library(ggmap)
# mapping with sentiment scores
usa_center <- as.numeric(geocode("United States"))
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=United%20States&sensor=false
USAMap = ggmap(get_googlemap(center=usa_center, scale=2, zoom=4), extent="device")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=37.09024,-95.712891&zoom=4&size=640x640&scale=2&maptype=terrain&sensor=false
# mapping with all data
MAP4 <- USAMap +
geom_point(aes(x=lon, y=lat), col=ifelse(((total$score>=0)),"brown1", "blue"), data=total, alpha=0.4, size=total$absolute_score) +
scale_size_continuous(range=total$score)+
ggtitle("U.S Mapping under the Total Dataset")
MAP4
## Warning: Removed 6 rows containing missing values (geom_point).
Since it is hard to distinguish the distribution of positive and negative opinions, we then create three more maps based on different tweet topics: "@RealDonaldTrump OR #DonaldTrump“,”#MakeAmericaGreatAgain“, and”#NeverTrump“.
# mapping under topic "trump"
MAP <- USAMap +
geom_point(aes(x=lon, y=lat), col=ifelse(((data1$score>=0)),"brown1", "blue"), data=data1, alpha=0.4, size=data1$absolute_score) +
scale_size_continuous(range=data1$score)+
ggtitle("U.S Mapping under #DonaldTrump")
MAP
## Warning: Removed 5 rows containing missing values (geom_point).
# mapping under topic "make america great again"
MAP2 <- USAMap +
geom_point(aes(x=lon, y=lat), col=ifelse(((data2$score>=0)),"brown1", "blue"), data=data2, alpha=0.4, size=data2$absolute_score) +
scale_size_continuous(range=data2$score)+
ggtitle("U.S Mapping under #MakeAmericaGreatAgain")
MAP2
## Warning: Removed 1 rows containing missing values (geom_point).
# mapping under topic "never trump"
MAP3 <- USAMap +
geom_point(aes(x=lon, y=lat), col=ifelse(((data3$score>=0)),"brown1", "blue"), data=data3, alpha=0.4, size=data3$absolute_score) +
scale_size_continuous(range=data3$score)+
ggtitle("U.S Mapping under #NeverTrump")
MAP3
In our conjecture, #DonaldTrump is a neutral topic. Its map verified our opinion by showing a comparatively more mixed-sentimental distribution.
Besides, there are a lot more positive tweets (red spots) on the map of #MakeAmericaGreatAgain than on the other two topic maps, which totally makes sense since this topic has a more positive attitude. Also on this map, lots of red points lie on the east coast of America, representing that the public on the east coast are more inclined to this topic than people in other regions.
On the contrast, the map of #NeverTrump shows more blue points than the other two topic maps, which are distributed evenly on the map.
After making three topic maps, we decide to further explore the tweets distribution and their sentiment levels for different states. Here we choose three representative states: Florida, California, and Michigan.
# sentiment across different states
florida_center <- as.numeric(geocode("Florida"))
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Florida&sensor=false
FLMap = get_googlemap("Florida", zoom=6, maptype = "roadmap", crop = FALSE)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Florida&zoom=6&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Florida&sensor=false
FL <- ggmap(FLMap) +
geom_point(aes(x=lon, y=lat), col=ifelse(((total$score>=0)),"brown1", "blue"), data=total, alpha=0.4, size=total$absolute_score) +
scale_size_continuous(range=total$score)
FL
## Warning: Removed 4224 rows containing missing values (geom_point).
CA_center <- as.numeric(geocode("California"))
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=California&sensor=false
CAMap <- get_googlemap("California", zoom = 6, maptype = "roadmap", crop = FALSE)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=California&zoom=6&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=California&sensor=false
CA <- ggmap(CAMap) +
geom_point(aes(x=lon, y=lat), col=ifelse(((total$score>=0)),"brown1", "blue"), data=total, alpha=0.4, size=total$absolute_score) +
scale_size_continuous(range=total$score)
CA
## Warning: Removed 4129 rows containing missing values (geom_point).
MI_center <- as.numeric(geocode("Michigan"))
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Michigan&sensor=false
MIMap <- get_googlemap("Michigan", zoom = 6, maptype = "roadmap", crop = FALSE)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Michigan&zoom=6&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Michigan&sensor=false
MI <- ggmap(MIMap) +
geom_point(aes(x=lon, y=lat), col=ifelse(((total$score>=0)),"brown1", "blue"), data=total, alpha=0.4, size=total$absolute_score) +
scale_size_continuous(range=total$score)
MI
## Warning: Removed 4222 rows containing missing values (geom_point).
Florida, a deep-red state, shows a map of nearly all positive opinions toward Donald Trump. We can see that the size of red points on this map are fairly large, meaning that these positive opinions are relatively strong and solid.
The map for California contains most negative opinions (blue points) among three states. This makes sense since California voted for Hillary Clinton during the president election period. We can see from the map that there are three main clusters of points, which shows that the twitter users who posted opinions about Donald Trump are mainly in big cities, such as San Francisco, Los Angeles, Fresno and San Diego.
For Michigan, the map shows a lot of positive opinions as well as some negative ones, though the sentiment levels of positive opinions are overall much stronger than the ones of negative opinions.
As for the Shiny application, we created an interactive map that easier to use compared to static map in ggplot. For this shiny app, we found it would be more inspiring to focus on the popularity of the content. As a result, we used the raw data from number of retweets counts of each tweets we generated that with greater number of retweets, we considered that individual tweet more popular. In the map, the deeper the color of the popup points, the more popular the tweet content is.
By utilizing the Shiny features that users can easily interact with the size and content of the map, we plotted the map on the overall United States on the beginning. Once users could get access to the map, the shiny feature enable the users to explore the data based on their preference. Choices are open while users can zoom the scale of the map and click on every individual points to find out more detail about that tweet. Upgraded from ggmap experience, Shiny offers a great opportunities for better data visualization and interaction.
Moreover, we created the shiny app to create a interactive word cloud app. In this app, moving from static word cloud, we integrated and enabled the choices in the frequency of words appearance. Moreover, we created three different panel choices for users to select three different hashtags thus to be presented with different word cloud images.
During the creative process, we confronted problems with error in reactive functions and as publishing the app to share. Even though our code are still immature in some of the part concerning this shiny app especially with the word cloud, we hope that we could solve most of our problems in the future exploration with Shiny.
As we are going forward in the future, we are looking forward to improve and expand our project in following several aspects.
To find the optimal ways to collect data from real-time service data website. Because of the API limit of twitter and google, we got limited access to a larger data set. As we noticed during the process, a larger dataset will not only saturate our project with more factual evidences, but also will offer a wider range of choices in our analysis as a whole. To explore and figure out a better way to present the popularity of tweets based on various geographic areas in the States. We think it would be beneficial to plot a heatmap based on different states if we can extract the specific states location from our data. To compare current result in a larger scope of context. We look forward to analyze more on the similarity between social media reaction and actual election results if we will have access to historical data.